There are some clear gaps in the data, meaning NaN values need to be addressed. The first heat plant HS1 has the most available data, thus HS1_T1, HS1_T2, and heat are amongst the best parameters to choose for the temperature of substation 1. In order to keep the model statistically sound, outliers are filtered and one model per season is considered. For the sake of simplicity two models will be created: one for the winter and one for the summer, as heat demand and consumption increases during winter times. Additionally, the third heat plant seems to be operational only under specific unknown conditions, therefore it is also worth analyzing if filtering data while the plant is on makes sense.

This graph shows that some outliers need to be filtered. It is also helpful, because it shows a clear distinction in heat transfer between the seasons, during summer times staying at low values around 60°C, and during winter time ranging between 70-90°C. Periods with missing data are connected with a straight line which can be ignored.

This graph also shows a clear distinction in heat transfer between the seasons, during summer times staying at low values below 1500 and during winter time going as high as 8000. Periods with missing data are connected with a straight line which can be ignored.

These scatter plots prove the linear behavior between both parameters and the presence of outliers that need to be filtered. The plots can also be considered to contain "two" lines with different slopes, which can be obtained separately by splitting the data between winter and summer. Furthermore, the scatterplot for heat shows that it has a more linear behavior than HS1_T1

The histogram proves that most of the time heating plant 3 is off. Therefore, all data when the plant is on will be deleted.

It can be observed that both scatter plots now have a unique linear behavior. The temperature of the heating plant can be delayed, as the temperature in the substation does not increase at the same pace of the plant.

Separating the data sets by using the heat limits as a parameter seems to show a more linear behavior for both sets. This needs to be proved by using R2 parameter.

Correlation matrix for the filtered data shows that the best variable for substation 1 is heat . A value closer to unity means the variable is able to represent the dependent variable accurately. Even though the other substation's temperatures have a vaue closer to 1, it doesn't make sense to consider these parameters as they don't supply heat to the network. Additionally, it is not realistic to expect a value of 1, given that physically other variables affect the temperature at the substation. With this in mind the most important parameters are heat, temperature of the supply station, and the mass flow.

It makes sense to delay temperature supplied by the substation, in contrary delaying heat transfer makes little sense as power is already a measurement of the rate of energy transfer. From the barplot it can be seen that the highest pearson coefficient is for delays of 3-4 hours.

For summer no delays will be considered given that the consumption is considerably lower, in winter a delay of 4 hours is considered due to its higher pearson number and the possible slower changes in temperature across the network as a whole.

The score and the plot show that the linear regression is quite inaccurate for the summer model, by barely scoring almost 2% for R2. This means the temperature of the first heating plant is not able to linearly describe the temperature at the first substation.

Using heat as the independent variable is still not accurate, but definitely better than using temperature 1 of the first heating plant, as the R2 increases from 1.9% to 9.2%. The scatter plot shows that the data is still too dispersed, and a linear model of one input is not enough to make accurate predictions.

As expected the winter model is more accurate, since the scatter plot shows a more obvious linear trend between the dependent and independent variable. A 65% R2 indicates a relatively high score compared to the summer model while using supply temperature 1.

The winter model by using heat is less acurrate than using the supply temperature 1, as the R2 for this model is around 10% less at 54%. Based on R2 scores, the best model for predicting node1_T1 temperatures would be a combination of the summer model using heat as input, and the winter model using HS1_T1 as the input.

Additional parameters selected are those with the highest pearson coefficients: heat, HS1_G2, HS2_G1

The equation for the linear regression has the shape of:

y = 51.50364601630728 + 0.00219089(heat) + 0.09158439(HS1_T1) + 0.00013776(HS1_G2) - 0.00035828(HS2_G1)

The score of the summer model, when data is filtered by the substation's temperature, increases significantly from 2% to 46% by adding inputs that have a higher pearson correlation.

When using heat to filter the data, the correlations for all parameters are quite low, which will lead to a model of lower accuracy. It is however proven that the correlation by using heat as input is 0.3, while using HS1_T1 on the previous matrix it is 0.14. Consequently, this explains the results obtained from the previous scatter plots and their respective linear regression models, where the more accurate model used heat.

The equation for the linear regression has the shape of:

y = 50.42344013295554 + 0.00338078(heat) + 0.08888249(HS2_T1) + 0.00026093(HS1_G2) - 0.0002713(HS1_G1)

As mentioned before, this model is quite inaccurate even by adding new inputs, as the score only increases from 9% to 16%. Compared to the summer model with 4 inputs when the data is filtered according to the substation's temperature with an R2 value of 46%.

As seen from previous data, the correlations for the winter model are quite high. The parameters with highest correlations are: heat, HS_T1_del4, HS1_T2, and HS2_T1. Even though HS_T1 is also high, it is not considered as its delayed verison is included.

The equation for the linear regression has the shape of:

y = 18.769292785207163 + 3.81579495e-04(heat) + 3.26788154e-01(HS1_T1_del_4) + 4.88086021e-01(HS1_T2) + 1.43602310e-01(HS2_T1)

The score of the winter model increases from 65% to 71%.

When the winter model is filtered using heat, the highest parameters are also: heat, HS_T1_del4, HS1_T2, and HS2_T1.

The equation for the linear regression has the shape of:

y = 11.718470378984144 + 0.00106214(heat) + 0.37738452(HS1_T1_del_4) + 0.48895896(HS1_T2) + 0.14020258(HS2_T1)

Conclusion

The model's score increases from 54% to 75%, which yields the highest accuracy out of all models. So the most accurate winter model is when the data is filtered by heat and using the following inputs: heat, HS_T1_del_4, HS1_T2, and HS2_T1. The most accurate summer model is when the data is filtered by the substation's temperature and using the following inputs: heat, HS1_G2, HS2_G1, and HS1_T1. Additionally, the models with multiple inputs have higher accuracy than those with a single input as proven by these exercises. Lastly, for multiple input models, the prediction equations are provided instead of plots, since plots with 4 dimensions are harder to visualize and understand than 2D graphs.